Mechanism to Calculate Probability of a Cyber Security Incident

ABSTRACT

An Archetype Software Invention which calculates the probability of a cyber security incident for a given computer by correlating the distribution of computer program files with the occurrences of security incidents across a large number of computers.

BACKGROUND Field of Invention

This Invention relates to computer applications which will protect a corporate enterprise from security incidents, including unauthorized intrusions and malicious computer programs.

Description of Prior Art

The foundation of a good cyber security policy for any corporate or government enterprise is a security risk assessment: the probability of a security incident and the impact if it were to occur. The amount of risk that can be tolerated and how to mitigate the risk can be determined based upon the risk assessment.

A security risk assessment is difficult to perform, due in part to the difficulty of assessing probability that a security incident could occur. Current methods amount to a subjective rating of known vulnerabilities for an enterprise. ISO 2700 standards even recommend that several people perform the analysis and that their opinions be averaged. Current methods are also manual, laborious and time consuming to perform, and are therefore performed infrequently.

OBJECTS AND ADVANTAGES

Accordingly, we claim the following as our objects and advantages of our invention:

-   -   1. To objectively estimate the probability of a security         incident based upon a statistical correlation of program files         present on a computer with security incidents     -   2. To automatically and continuously calculate the probability         of a security incident,

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1, System diagram

FIG. 2, Database Schema of the system.

LIST OF OBJECTS IN FIGURES

-   10 Computers which have been previously involved in a security     incident -   20 A computer which is part of the collection of computers under     analysis -   30 The software agent which collects information about program files -   40 File system of the computer -   50 Common database with information about program files and     computers -   60 Computer to analyze data in common database and calculate     probability -   100 Schema for database 50 -   110 Database table containing identification information for each     computer within the collection analyzed -   120 Database table joining computers 110 with program files 140 and     directories 130 -   130 Database table containing the names of all directories on all     computers 110 -   140 Database table containing information on all program files on     all computers 110 -   150 Database table containing primary keys and time of each analysis -   160 Database table containing primary keys for groups used to     analyze program file distributions in order to form program file     bundles -   170 Database table linking computers to groups -   180 Database table containing the number of times each program file     is found within the computers within each group, used to form     program file bundles -   190 Database table containing primary keys for program file bundles -   200 Database table containing probability values for each program     bundle at the time of an analysis -   210 Database table linking program files 140 to bundles 190 for any     particular analysis 150 -   220 Database table containing the final result of probability values     for each computer 110 at the time of each analysis 150

DESCRIPTION OF THE PREFERRED EMBODIMENTS

Probability of a security breach is calculated by analysis of program files present on a collection of computers. The collection of computers is large enough that some of the computers have previously been involved in a security incident 10, for example, infected by malware. Each computer 20 has a software agent 30 which reads program files on disk drives 40 attached to the computer. The agent maybe a Windows NT Service in the case of a Windows operating system, or a demon in the case of a Linux operating system. The agent performs a checksum calculation and sends information on the program file name, directory and checksum to a database 50 with schema 100 over the internet using, for example a TCPIP or HTTP protocol. A computer 60 reads the information in the database, calculates probability for each computer, and saves the information back into the database. Probability for each computer can then be read from the database in order to perform a risk assessment.

Operation

The Invention utilizes a statistical approach when analyzing program files. It is assumed that there are enough computers analyzed that a sufficient number of the computers have previously been involved in a security incident 10 so that an accurate statistical analysis can be perform.

Computer Registration

The operations described here, which are performed by the Agent program 30 located on each networked computer 20 can be performed within many different operating systems: Linux, Unix, Mac OS, various Windows OS, Google Android OS, for example, but will be described here for the Windows 7 Operating system.

The very first time the Agent program 30 starts, it calculates a unique number (GUID) which it then stores locally, for example, in the registry. The new GUID will be stored in the COM_GUIDIdentifier column of the COM_Computer table 110. This GUID will be used in all communications from the Agent to the Database Publisher to identify the computer.

File Registration

The principle task of the Agent program is to insure that the information in tables 120, 130, and 140 accurately represents the program files which can be found in the computer's file system.

To accomplish this, the Agent can periodically inventory the file system. The Agent begins an inventory by connecting to the database and downloading a local list of program files and directories. This list contains filename, file size, and file checksum, and directory for all the program files which were present when the Agent program last ran. This list can be generated by joining the COM_Computer table with the COP_ComputerPathFile 120, the FIS_File 140 and DIR_Directory 130 table and filtering, using the COM_GUIDIdentifier column as follows:

Script 1, Return a local file list SELECT COP_COM_ComputerID ,FIS_FileID ,DIR_DirectoryID ,FIS_FileName ,FIS_FileSize ,FIS_FileChecksum ,DIR_Directory FROM COP_ComputerPathFile LEFT INNER JOIN FIS_File ON FIS_FileID = COP_FIS_FileID LEFT INNER JOIN DIR_Directory ON DIR_DirectoryID=COP_DIR_DirectoryID LEFT INNER JOIN COM_Computer ON COM_ComputerID=COP_COM_ComputerID WHERE COM_GUIDIndentifier = @GUID

Next, after the local list is downloaded, the Agent performs an inventory of the file system, comparing the program files found with program files in the list. For program files found in the file system but which are not in the list, the Agent creates an entry in the COP_ComputerPathFile table which links the corresponding file entry in the FIS_File table, the corresponding directory in the DIR_Directory table and the corresponding computer in the COM_Computer table. If the file or directory have not yet been registered, the Agent can first create the FIS_File and DIR_Directory entries.

For entries in the list that cannot be found in the file system, the Agent can delete the corresponding COP_ComputerPathFile table.

Security Incidents

When a computer is involved in a security incident, the COM_Incident bit can be set for that computer in the COM_Computer table. A security incident might include a detected break-in or malicious software which is found within the file system. Malicious software might be found by periodically scanning the FIS_File table for known malware, then identifying the computers which contain that software by linking the FIS_File table to the COM_Computer table though the COP_ComputerPathFile table and filtering by the FIS_FileID for known malware.

Once the COM_Incident bit is set, it will not be unset even if the malicious file is removed from the computer.

Analysis: Calculate Group Values Each File

With program files cataloged and computers involved in security incidents identified, a statistically based analysis is performed. Each time an analysis is performed, a new row is added to the ANA_Analysis table 150 and the ANA_Date is set to the current date and time.

Analysis begins by dividing computers into groups of one or more computers. The purpose of the groups is identify files with identical distribution patterns so that these files can be treated as a single program collection or bundle during the correlation analysis. Groups can contain more computers to speed analysis or fewer computers to increase sensitivity in identify program files with similar distribution profiles.

When a group is formed, a row is added to the GRP_Group table 160 with a link to the ANA_Analysis table through the GRP_ANA_AnalysisID foreign key. A row is entered into the GRC_GroupComputer table 170 for each computer which is part of this group, thus linking the computer to the group.

The analysis continues by counting the number of times each file can be found in each group. For each file, a row is entered in the GRF_GroupFileValue table 180 and GRF_Value is set to the number of times the file is found on the computers within that group. Because many files can be found multiple times on a computer, it is essential to count a file no more than once-per-computer. The following two step SQL script can be used to set the value for GRF_Value if a row has already been inserted in the GRF_GroupFileValue table, where @GroupID and @FileID are variables for the GRP_Group and FIS_File table primary keys:

Script 3, Calculate GRF_Value SELECT DISTINCT COP_FIS_FileID INTO #TempCountFile FROM GRC_GroupComputer INNER JOIN COP_ComputerPathFile ON COP_COM_ComputerID=GRC_COM_ComputerID WHERE GRC_GRP_GroupID=@GroupID AND COP_FIS_FileID=@FileID UPDATE GRF_GroupFileValue SET GRF_Value = ( SELECT COUNT(COP_FIS_FileID) FROM #TempCountFile) WHERE GRF_FIS_FileID=@FileID AND GRF_GRP_GroupID=@GroupID

For a given file and a given analysis, the collection of GRF_Value values form the distribution profile.

Analysis: Organize Files into Program File Bundles

Once GRF_Value values are calculated for each group for each file, the files are ready to be organized into Program File Bundles. A Program Bundle here connotes a collection of program files with the same distribution profile.

Similarity in distribution profiles is evaluated using Equation 1, where d_(ab) is the distance between files f_(a) and f_(b), N is the number of groups, G_(nfa) and G_(nfb) are the GRF_Value values for group n, file f_(a) and file f_(b) respectively.

$\begin{matrix} {{d_{ab} = \sqrt{\sum\limits_{n = 0}^{N}\left( {G_{{nf}_{a}} - G_{{nf}_{b}}} \right)^{2}}},} & {{Equation}\mspace{14mu} 1} \end{matrix}$

The following SQL View can be used to calculate Equation 1, and allows filtering by file and analysis.

Script 4, Calculate Distance between Files CREATE VIEW v_DistanceBetweenFiles AS SELECT FileOne.GRP_ANA_AnalysisID ,FileOne.GRF_FIA_FileSignatureID AS GRF_FileOne_FIA_FileSignatureID ,FileTwo.GRF_FIA_FileSignatureID AS GRF_FileTwo_FIA_FileSignatureID ,sqrt( sum( (FileTwo.GRF_Value-FileOne.GRF_Value) * (FileTwo.GRF_Value-FileOne.GRF_Value) ) ) AS Distance FROM GRF_GroupFileValue FileOne ,GRF_GroupFileValue FileTwo ,GRP_Group WHERE FileTwo.GRF_GRP_GroupID=FileOne.GRF_GRP_GroupID and GRP_GroupID=FileOne.GRF_GRP_GroupID GROUP BY FileOne.GRF_ANA_AnalysisID, FileOne.GRF_FIA_FileSignatureID, FileTwo.GRF_FIA_FileSignatureID

When a set of files is found where d_(ab) is zero between each file, either a new row is inserted in the BUN_Bundle table 190, or an existing Program File Bundle is found where at least 50% of the files are deemed in common. A new row is inserted in the BUA_BundleAnalysis table 200, and new rows are inserted into the BUF_BundleAnalysisFile table 210, with the primary key of a new or existing Program Bundle 190, the primary keys of the files 140, and the primary key of the current analysis 150. In this way, each Program File Bundle will contain one or more files for one or more analysis.

Analysis: Calculate Probability for Program File Bundles

A probability is calculated for each Program File Bundle, where P_(b,a) is the probability value for bundle b, at the time of analysis a; I_(b,a) is the number of once infected computers that have bundle b at the time of analysis a (i.e. computers where the COM_Incident bit has been set to 1 and that also contain the files which form bundle b), C_(b,a) is the total number of computers with bundle b at the time of analysis a and I_(a) is the total number of once infected computers at the time of analysis a (i.e. all computers where the COM_Incident bit has been set to 1), C_(a) is the total number of computers at the time of analysis a, and C _(B,a) is the average number of computers per bundle, across all bundles B at the time of analysis a.

$\begin{matrix} {P_{b,a} = {\left( {\frac{I_{b,a}}{C_{b,a}} - \frac{I_{a}}{C_{a}}} \right) \times \frac{C_{b,a}}{C_{\overset{\_}{B},a}}}} & {{Equation}\mspace{14mu} 2} \end{matrix}$

Equation 2 can be understood by replacing ratios with D, E and F (equation 3). Ratio D is the number of infected to total computers for bundle b, ratio E is the ratio of once infected to all computers. When a bundle has a better (smaller) ratio A than computers overall E, then D−E will be negative and the affect of the bundle will be to lower probability of infection for any computer where it appears. Ratio F, which is the number of computers where bundle b appears relative to the average for a bundle, is a measure of how widely distributed a bundle is, thus giving more weight to a bundle which is more widely distributed.

P _(b,a)=(D−E)×F  Equation 3,

P_(a,b) is then calculated for all bundles b for analysis a, and the BUA_Probability value is updated in table BUA_BundleAnalysis table 200.

Analysis: Calculate Probability for Computers

Finally, a probability can be calculated for each computer. The probability for a computer c is calculated by summing the probabilities of all Program File Bundles which can be found on a computer at the time of analysis a (Equation 4).

$\begin{matrix} {P_{c,a} = {\sum\limits_{c,a}P_{b,a}}} & {{Equation}\mspace{14mu} 4} \end{matrix}$

The P_(c,a) value can be saved in the ANC_AnalysisComputer table 220 as the ANC_Probability value, providing a way of trending probability values for any particular computer. 

What is claimed:
 1. A method for calculating the probability of a cyber security incident for a computer within a collection of computers comprising: a. providing a means for compiling program file names, checksums, and locations within the file systems on each computer within said collection of computers, where the identity of each said program file is determined by a unique combination of filename and checksum, b. providing a means for organizing the said program files into Program File Bundles based upon similar distribution across said collection of computers, c. providing a means for calculating a probability value for each said Program File Bundle according to the distribution of said program file bundle relative to computers previously or currently involved in a security incident and computers never involved in a security incident and which are part of said collection of computers, d. providing a means for calculating said probability of a cyber security incident for said computer, as a sum of said probabilities values for each said Program File Bundle present on said computer.
 2. The method for calculating the probability of a cyber security of claim 1 wherein the function for calculating the probability value for any said Program File Bundle is: $P_{b,a} = {\left( {\frac{I_{b,a}}{C_{b,a}} - \frac{I_{a}}{C_{a}}} \right) \times \frac{C_{b,a}}{{\overset{\_}{C}}_{B,a}}}$ where P_(b,a) is the probability value for any said Program File Bundle b, at the time of an analysis a, where an analysis is defined as a point in time where said Program File Bundles are identified, recorded and their said probability values are calculated and recorded; I_(b,a) is the number of said computers previously or currently involved in a security incident and that have said Program File Bundle b at the time of said analysis a, C_(b,a) is the total number of computers with Program File Bundle b at the time of said analysis a and I_(a) is the total number of said computers previously or currently involved in a security incident at the time of analysis a, C_(a) is the total number of computers at the time of analysis a, and C _(B,a) is the average number of computers per said Program File Bundle, across all said Program File Bundles B at the time of analysis a^(o)
 3. The method for calculating the probability of a cyber security incident of claim 2 wherein said means for organizing the identities of said program files into Program File Bundles based upon similar distribution across said collection of computers comprising: a. providing a means for dividing said computers into N cells, with one or more said computers in each said cell, b. providing a means for obtaining a collection of values {G_(nf)} which are the number of times each said program file f was found within said cell n, counting said program file no more than once per said computer even if it is found multiple times in the file system, and counting said program file once, even if it had been removed from said computer, c. providing a means for calculating a distance value d_(ab)=F₁({G_(nf) _(a) },{G_(nf) _(b) }), between every two said program files, symbolized by f_(a) and f_(b), where said distance value is some function F₁ of said collection of {G_(nf)} values for each said program file, d. providing a means for compiling all said program files together into Program File Bundles, where each said Program File Bundle consists of said component program files for which the distance d_(ab) between any two is zero or near zero,
 4. The method for calculating the probability of a cyber security incident of claim 3 wherein the function F₁ is ${d_{ab} = \sqrt{\sum\limits_{n = 0}^{N}\left( {G_{{nf}_{a}} - G_{{nf}_{b}}} \right)^{2}}},$ and where N is the total number of computer cells, 